Gram-CTC: Automatic Unit Selection and Target Decomposition for Sequence Labelling
نویسندگان
چکیده
Most existing sequence labelling models rely on a fixed decomposition of a target sequence into a sequence of basic units. These methods suffer from two major drawbacks: 1) the set of basic units is fixed, such as the set of words, characters or phonemes in speech recognition, and 2) the decomposition of target sequences is fixed. These drawbacks usually result in sub-optimal performance of modeling sequences. In this paper, we extend the popular CTC loss criterion to alleviate these limitations, and propose a new loss function called Gram-CTC. While preserving the advantages of CTC, Gram-CTC automatically learns the best set of basic units (grams), as well as the most suitable decomposition of target sequences. Unlike CTC, Gram-CTC allows the model to output variable number of characters at each time step, which enables the model to capture longer term dependency and improves the computational efficiency. We demonstrate that the proposed Gram-CTC improves CTC in terms of both performance and efficiency on the large vocabulary speech recognition task at multiple scales of data, and that with Gram-CTC we can outperform the state-of-the-art on a standard speech benchmark.
منابع مشابه
Multisyn: Open-domain unit selection for the Festival speech synthesis system
We present the implementation and evaluation of an open-domain unit selection speech synthesis engine designed to be flexible enough to encourage further unit selection research and allow rapid voice development by users with minimal speech synthesis knowledge and experience. We address the issues of automatically processing speech data into a usable voice using automatic segmentation technique...
متن کاملUnit selection synthesis database development using utterance verification
Accurate annotation of the unit inventory database is of vital importance to the quality of unit selection text-to-speech synthesis. The time consuming manual work involved in database development limits the ability to produce new voices quickly and at low cost. Automatic annotation is therefore more and more in use. Misalignments due to mismatch between the predicted and pronounced unit sequen...
متن کاملComparison of Decoding Strategies for CTC Acoustic Models
Connectionist Temporal Classification has recently attracted a lot of interest as it offers an elegant approach to building acoustic models (AMs) for speech recognition. The CTC loss function maps an input sequence of observable feature vectors to an output sequence of symbols. Output symbols are conditionally independent of each other under CTC loss, so a language model (LM) can be incorporate...
متن کاملAutomatic Prosody Labelling of read Norwegian
In this paper we present initial work on a method for automatic stress and boundary labelling of read EastNorwegian. The context of this work is automatic corpus annotation for unit selection speech synthesis. A phonological model of Norwegian prosody is described. The identification of syllable stress and major intonational boundaries are key prosodic events for building a prosodic description...
متن کاملJoin Cost for Unit Selection Speech Synthesis
In unit-selection speech synthesis systems, synthetic speech is produced by concatenating speech units selected from a large database, or inventory, which contains many instances of each speech unit with varied prosodic and spectral characteristics. Hence, by selecting an appropriate sequence of units, it is possible to synthesize highly natural-sounding speech. The selection of the best unit s...
متن کامل